Principles of Data Visualization and Introduction to ggplot2

I have provided you with data about the 5,000 fastest growing companies in the US, as compiled by Inc. magazine. lets read this in:

In [9]:
library(ggplot2)
library(tidyr)
library(dplyr)
library(plotly)
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA_608/master/module1/Data/inc5000_data.csv", 
                header= TRUE)

And lets preview this data:

In [10]:
head(inc)
RankNameGrowth_RateRevenueIndustryEmployeesCityState
1 Fuhu 421.48 1.179e+08 Consumer Products & Services 104 El Segundo CA
2 FederalConference.com 248.31 4.960e+07 Government Services 51 Dumfries VA
3 The HCI Group 245.45 2.550e+07 Health 132 Jacksonville FL
4 Bridger 233.08 1.900e+09 Energy 50 Addison TX
5 DataXu 213.37 8.700e+07 Advertising & Marketing 220 Boston MA
6 MileStone Community Builders179.38 4.570e+07 Real Estate 63 Austin TX
In [11]:
summary(inc)
      Rank                          Name       Growth_Rate     
 Min.   :   1   (Add)ventures         :   1   Min.   :  0.340  
 1st Qu.:1252   @Properties           :   1   1st Qu.:  0.770  
 Median :2502   1-Stop Translation USA:   1   Median :  1.420  
 Mean   :2502   110 Consulting        :   1   Mean   :  4.612  
 3rd Qu.:3751   11thStreetCoffee.com  :   1   3rd Qu.:  3.290  
 Max.   :5000   123 Exteriors         :   1   Max.   :421.480  
                (Other)               :4995                    
    Revenue                                  Industry      Employees      
 Min.   :2.000e+06   IT Services                 : 733   Min.   :    1.0  
 1st Qu.:5.100e+06   Business Products & Services: 482   1st Qu.:   25.0  
 Median :1.090e+07   Advertising & Marketing     : 471   Median :   53.0  
 Mean   :4.822e+07   Health                      : 355   Mean   :  232.7  
 3rd Qu.:2.860e+07   Software                    : 342   3rd Qu.:  132.0  
 Max.   :1.010e+10   Financial Services          : 260   Max.   :66803.0  
                     (Other)                     :2358   NA's   :12       
            City          State     
 New York     : 160   CA     : 701  
 Chicago      :  90   TX     : 387  
 Austin       :  88   NY     : 311  
 Houston      :  76   VA     : 283  
 San Francisco:  75   FL     : 282  
 Atlanta      :  74   IL     : 273  
 (Other)      :4438   (Other):2764  

Think a bit on what these summaries mean. Use the space below to add some more relevant non-visual exploratory information you think helps you understand this data:

In [12]:
count(inc, Industry)
Industryn
Advertising & Marketing 471
Business Products & Services482
Computer Hardware 44
Construction 187
Consumer Products & Services203
Education 83
Energy 109
Engineering 74
Environmental Services 51
Financial Services 260
Food & Beverage 131
Government Services 202
Health 355
Human Resources 196
Insurance 50
IT Services 733
Logistics & Transportation 155
Manufacturing 256
Media 54
Real Estate 96
Retail 203
Security 73
Software 342
Telecommunications 129
Travel & Hospitality 62
In [13]:
count(inc, State)
Staten
AK 2
AL 51
AR 9
AZ 100
CA 701
CO 134
CT 50
DC 43
DE 16
FL 282
GA 212
HI 7
IA 28
ID 17
IL 273
IN 69
KS 38
KY 40
LA 37
MA 182
MD 131
ME 13
MI 126
MN 88
MO 59
MS 12
MT 4
NC 137
ND 10
NE 27
NH 24
NJ 158
NM 5
NV 26
NY 311
OH 186
OK 46
OR 49
PA 164
PR 1
RI 16
SC 48
SD 3
TN 82
TX 387
UT 95
VA 283
VT 6
WA 130
WI 79
WV 2
WY 2
In [14]:
Sys.setenv("plotly_username"="AsherMeyers")
Sys.setenv("plotly_api_key"="18paMp1dQ1Mp9itZ3R7R")

Question 1

Create a graph that shows the distribution of companies in the dataset by State (ie how many are in each state). There are a lot of States, so consider which axis you should use. This visualization is ultimately going to be consumed on a 'portrait' oriented screen (ie taller than wide), which should further guide your layout choices.

In [15]:
ftable = table(inc$State)
ftable = ftable[order(ftable)]
ftable
 PR  AK  WV  WY  SD  MT  NM  VT  HI  AR  ND  MS  ME  DE  RI  ID  NH  NV  NE  IA 
  1   2   2   2   3   4   5   6   7   9  10  12  13  16  16  17  24  26  27  28 
 LA  KS  KY  DC  OK  SC  OR  CT  AL  MO  IN  WI  TN  MN  UT  AZ  MI  WA  MD  CO 
 37  38  40  43  46  48  49  50  51  59  69  79  82  88  95 100 126 130 131 134 
 NC  NJ  PA  MA  OH  GA  IL  FL  VA  NY  TX  CA 
137 158 164 182 186 212 273 282 283 311 387 701 
In [16]:
ftdf <- data.frame(ftable)
colnames(ftdf) <- c("State", "Frequency")
In [24]:
p <- plot_ly(ftdf, x = ~Frequency, y = ~State, color = I("gray80")) %>%
  add_markers(color = I("Navy")) %>%
  layout(
    title = "# of 5,000 Fastest Growing Firms, by State",
    xaxis = list(title = "Number of Firms (logscale)"),
    margin = list(l = 40) 
  )
p <- layout(p, xaxis = list(type = "log"), yaxis = list(dtick = 1.95))
p

#Source: https://plot.ly/r/dumbbell-plots/
In [75]:
# treemap
#install.packages('treemap', repos = "http://cran.us.r-project.org")
library(treemap)
ftdf$Display <- paste0(ftdf$State, ": ", as.character(ftdf$Frequency))
treemap(ftdf,
        index="Display",
        vSize="Frequency",
        type="index",
        title = "5,000 Fastest Growing Firms by State",
        palette = "Set3")

Question 2

Lets dig in on the state with the 3rd most companies in the data set. Imagine you work for the state and are interested in how many people are employed by companies in different industries. Create a plot that shows the average and/or median employment by industry for companies in this state (only use cases with full data, use R's complete.cases() function.) In addition to this, your graph should show how variable the ranges are, and you should deal with outliers.

In [87]:
#install.packages("plotly", repos = "http://cran.us.r-project.org")
In [17]:
incNY <- inc %>%
    filter(complete.cases(.)) %>%
    subset(State == "NY") %>%
    group_by(Industry) %>% 
    summarise(low = min(Employees), firstQ = quantile(Employees, 0.25), 
              median = median(Employees), thirdQ = quantile(Employees, 0.75),
             high = max(Employees))

incNY$Industry <- factor(incNY$Industry, levels = incNY$Industry[order(incNY$thirdQ)])

p <- plot_ly(incNY, color = I("gray80")) %>%
  add_segments(x = ~firstQ, xend = ~thirdQ, y = ~Industry, yend = ~Industry, name = "Middle 50%") %>%
  #add_trace(x = ~exp(seq(1, 8)), name = "exponential") %>%
  add_markers(x = ~low, y = ~Industry, name = "Low", color = I("Red")) %>%
  add_markers(x = ~high, y = ~Industry, name = "High", color = I("Green")) %>%  
  add_markers(x = ~firstQ, y = ~Industry, name = "25th Percentile", color = I("orange")) %>%
  add_markers(x = ~thirdQ, y = ~Industry, name = "75th Percentile", color = I("blue")) %>%
  add_markers(x = ~median, y = ~Industry, name = "Median", color = I("black")) %>%    
  layout(
    title = "Employees per Firm, by Industry",
    xaxis = list(title = "Number of Employees per Firm, Logscale"),
    margin = list(l = 200) 
  )
p <- layout(p, xaxis = list(type = "log"), yaxis = list(dtick = 1))
p

#Source: https://plot.ly/r/dumbbell-plots/

Question 3

Now imagine you work for an investor and want to see which industries generate the most revenue per employee. Create a chart that makes this information clear. Once again, the distribution per industry should be shown.

In [3]:
library(plotly)
incRev <- inc %>%
    filter(complete.cases(.)) %>%
    group_by(Industry) %>% 
    mutate(RevenuePerEmp = round(Revenue/Employees/1000)) %>%
    summarise(low = min(RevenuePerEmp), firstQ = quantile(RevenuePerEmp, 0.25), 
              median = median(RevenuePerEmp), thirdQ = quantile(RevenuePerEmp, 0.75),
             high = max(RevenuePerEmp))

incRev$Industry <- factor(incRev$Industry, levels = incRev$Industry[order(incRev$thirdQ)])


p <- plot_ly(incRev, color = I("gray80")) %>%
  add_segments(x = ~firstQ, xend = ~thirdQ, y = ~Industry, yend = ~Industry, name = "Middle 50%") %>%
  #add_trace(x = ~exp(seq(1, 8)), name = "exponential") %>%
  add_markers(x = ~low, y = ~Industry, name = "Low", color = I("Red")) %>%
  add_markers(x = ~firstQ, y = ~Industry, name = "25th Percentile", color = I("orange")) %>%
  add_markers(x = ~thirdQ, y = ~Industry, name = "75th Percentile", color = I("blue")) %>%
  add_markers(x = ~high, y = ~Industry, name = "High", color = I("Green")) %>%
  add_markers(x = ~median, y = ~Industry, name = "Median", color = I("black")) %>%  
  layout(
    title = "Revenue Per Employee",
    xaxis = list(title = "Annual $K Revenue per Employee, Logscale"),
    margin = list(l = 250)
  )
p <- layout(p, xaxis = list(type = "log"), yaxis = list(dtick = 1))
p

#Source: https://plot.ly/r/dumbbell-plots/